Electronic circuits

ABSTRACT

A method of synchronizing a circuit comprising the steps of synchronising the circuit globally using a high-frequency clock signal, further synchronising at multiple lower frequencies by cooperative short-range state machines clocked by the high-frequency clock, and synchronising the state machines to each other by exchanging rollover signals between them.

The represent invention relates to developments pertaining to the fieldsof endeavour of the applicants own earlier International application noWO 01/89088, U.S. application Ser. No. 09/529,076 (national phase ofPCT/GB00/00175), U.S. patent application Ser. No. 10/167,639 (divisionalof U.S. Ser. No. 09/529,076), U.S. patent application Ser. No.10/167,200 (continuation in part of U.S. Ser. No. 09/529,076), as wellas that of internation application no PCT/GB2002/005514, the disclosureof all of which are incorporated herein by reference.

Further explicitly incorporated herein is the contents of thehereinafter reference UK patent application, the disclosure of whichforms part of the present application a dn the inventions disclosedherein.

British Application No 0203605.1

The figures referenced below are those shown on sheets 1/53 to 17/53 ofthe drawings of the present application.

Hierarchical Clocking System.

Frequency Division/Pulse Latching/Adiabatic Systems

This scheme is designed to enable the Rotary Clocking Architecture tosupport legacy low-speed clock network topologies while allowing RTWOdirect high-speed low-power clocking to be inserted for newly designedblocks.

Also assists in integrating SOC designs where multiple clock frequenciesand clock phases are required.

Methods of achieving lower frequency-divided energy-efficient‘adiabatic’ clocks from RTWO with special waveshape and phasing featuresare also described.

Note:—Throughout the text, assumption is made that there is either acontrol program, built into the VLSI device or else ofd chip hardwarewhich is able to load and read the various shift registers and dataregisters—either serially or parallel. Methods to do this are widelyknown and standard

This application's background material is within, patent applicationPCT/GB00/00175 which is hereby included complete by reference.

General Idea:

-   -   Distribute RTWO at overclock frequency. This clock e.g. 10 GHz        provides anti-phase clock edges at each % cycle e.g. 50 pS for        10 GHz clock (100 PS cycle). The full-speed clock is suitable        for many application directly (high speed ALU, SERDES I/O        ports).    -   Centrally located FLL (Frequency locked loop) to control the        master ‘overclock’.—preferable to a Phase locked Loop.    -   Features:        -   Coarse control (Frequency division—digital)        -   Medium control (Switched Capacitor—digital)        -   Fine control (Varactor—analogue)    -   Advantages over PLL        -   Much more stable loop        -   Lower power        -   Lower area        -   Higher speed        -   Better stabilty (Jitter, Skew)        -   Phase locking between multiple-frequencies    -   Phase locking is provided by RTWO inherent phase lock mechanisms        (2 types: junction locking (inter-chip), delay-matched links        (intea-chip).—works on the principle that if frequencies are        locked, phase locking is simple matter of getting the        “externally phase indifferent” rotating waves synchronised.    -   Use the ‘overclock’ to produce not just frequency divided but        arbitrary waveshapes, phase-aligned to the reference clock for        various applications.        -   Legacy UO clocks—e.g. Pulse clocks        -   Low frequency clocks for Global (e.g. Cache, long range            parallel busses)    -   Allow replacement for active “deskew” mechanism.    -   Digitally controlled advance/retard phasing.—Eliminate        cross-conduction current spikes.    -   Arbitrary repetitive waveform—High/Low periods, fractional N,        possible.    -   Gives all features required of high-end processors including        test clocks, etc.    -   Gives high-speed phased locked peripheral clocks for SERDES        (Serial/Deserial).—Local high-speed clocking for ALU etc, from        main clock.        Topology.

Previous descriptions of RTWO structures have extensively useddistributed components such as back-back inverters, switched capacitors,varactors etc located around the RTWO transmission-line path forfrequency control, rotation direction bias etc.

In this application, these pieces are brought into a modulararchitecture alongside Waveshape generation components in what we referto as “Binary Waveshaping Blocks” (BWBs). The architecture makes RTWOfit into a wide range of current VLSI synchronous clocking methodologiesused in industry today without any change in underlying methodology.

There are inherent advantages in using RTWO waves directly in 2-phasenon-ovelapping latching style which are not fully realised by thisapproach, and it is anticipated that a mix of the pure RTWO clock fornew components and hierarchical RTWO clocking will be the bestcomprimise in a multi-frequency environment.

FIG. 1—Architecture.

Representative VLSI chip is shown with RTWO transmisson-lines andinverters evident

-   -   REFCLK input:—will be used to get the on-chip RTWO system        synchronised precisely to an external reference frequency        supplied on this pin.    -   Phase lock “Synchronisation strap” point is show on left side.        These have been described in previous application and allow        phase locking between RTWO chips by hard-locking. [The        alternative method of PLL type alignment has not been dismissed        as another solution]

In the centre of the chip, two blocks are shown.

-   -   BWBO        -   This is the primary “Binary Waveshaping Block” for the chip.        -   It supplies the source of the Qn and *Qn Multi-cycle            synchronisation signals (see further below and FIG. 2)    -   FILL        -   Frequency-locked Loop.        -   This circuit ensures that the main RTWO operating frequency            of the chip is closed-loop controlled to be exactly some            multiple of the input REF CLK which could come from external            system standard e.g. Quartz Crytal.            Essentially, if the RTWO frequency is higher than (REF_CLK            xX) it is reduced by Varactor or Switched capacitor control            until it is precisely locked in Frequency. Detailed            operation is described further below    -   Absent: PLL        -   In theory, frequency and phase can be controlled to an            external reference using a PLL and Phase-Frequency            comparator. In practice, there is so much uncertainly in            phase on the REF_CLK especially as it travels into and then            across the chip, that it is useless as a phase reference.            Phase locking between the RTWO chip and an external phase            can be achieved with hard wire locking (described in            previous applications) -OR- by using a implicit phasing            information e.g. By detecting the edges of an incoming NRZ            data stream and adjusting the phase of the RTWO rings (via            Varactor control) until the data is sampled synchronously.            [TBD]            Multiple Global, Frequency-Divided Clocks:

The object of this architecture is to produce clocks related infrequency and phase to each other all around the chip. The main RTWOclocking array gives precise phase relationships between all points onthe chip for 360 degrees of phase due to pulse combination mechanism ontransmission-line—see JSSC paper.

Where multi-cycle events are to be synchronised (e.g. To generate aclock which is 1/10 of the main RTWO frequency), not only is asequential state machine required to perform the sequencing overmulti-cycles, but since this /N clock should be phase-aligned with other/N clocks on the chip, there has to be some global synchronisationsignal to keep the states of the state machines in sych, to they all gothrough state 0 together.

An obvious method is to distribute a global ‘synch’ wire around the chipfor every derived clock—but this wire would need to be designed totravel the entire chip with precise timing with skew a fraction of themaster RTWO clock cycle. This is just as difficult a problem asgenerating a conventional H-tree clock and is infeasible.

Instead, we propose to have each of the state-machines in the BWB blockssignal to it's neighbour when it has completed its sequence prior tolooping. The signalling distance is therefore short. In effect, each BWBsignals to it's neighbour that it is about going to ‘loop’ to state 0 inthe next RTWO cycle (or ½ cycle), which the receiving BWB will take as acommand to go to state 0 on it's next RTWO clock edge ensuringeventually that all BWB states come into sych across the chip.

(Power consumption due to this is low—the frequency is Nx less than RTWOfrequency and the load capacitance is just a pair of reciever gates ateach BWB)

A drawback of this approach is that it takes Nx (number of BWBs) RTWOclock cycles before the whole chip has it's Multi-cycle state machinessynchronised

To mitigate this, possible to “fan-out” from the primary BWB to drivesay 4 near-neighbours, from each BWB.

The upshot of all this logic is that there is a “Global” i.e. Chip-widesequence (or RTWO cycle) number available, which allows for logic whichresponds sychronously over the whole chip at rates lower than fRTWO.

BWB Circuitry Details:

Qn and *Qn outputs from the sequencer/state machine perform thisfunction in Fig L. And can be seen on the insets daisy-chaining betweenBWB blocks.

Qn and *Qn are the true and complement of the last-state of the loopwithin the Sequencer.

FIG. 2 shows waveforms of two possible sequencer state machine. Themachine can be as simple as a /N counter with output logic to generatethe last state (i.e. N−1), or could be a “One-Hot” AKA “Moving Spot”state machine where the last state is on an explicit output.

FIG. 2 a Illustrates a /N counter with a “LASTin” input and “LASTout”output which allows it to be synchronised by previous /N counters inBWBs, and allows it to synchronise the next /N counter in following BWBusing it's LASTout.

LASTout goes high on the count just before the /N counter returns tozero internally. LASTin is a registered input which when high, forcesthe counter to go to count 0 on its next count.

Sequencing can be used to generate arbitrary waveforms. In the simplestcase, a /N counter is a sequencer which gives a 0->1->0 output sequencewhen a total of N clock pulses are given to it.

Arbitrary Waveforming

A more general purpose clock waveform generator can be made using aN-state sequencer (“One-Hot encoder” or “Moving Spot”) coupled withgating and an output buffer.

This has a similar multi-cycle synchronisation system to the /N counterand has been discussed previously, it used *SYNC and SYNC inputs toreceive a *Qn and Qn input from previous stage and outputs it's own *Qnand Qn to the next stage.

NOTE:—Synchronisation is an N-clock sychronisation, there is still awithin-cycle phase offset depending on the BWB block's location on theRTWO line.

In FIG. 2 b shows block diagram and timing sequence of “Moving Spot”based sequencer. The Primary BWB (BWBO) is different from the other BWBsbecause it generates it's own feedback from its output via a MUX.

Selection on the MUX allows variation on the length of the sequenceprogramatically if desired [when connected to an on-chip or ofd chipmicroprocessor].

One method of making this Moving spot register is with shift registerelements. Another method is to use dedicated logic, such as shown inFIG. 3. Illustrating a dual “Moving Spot” generator to get true andinvert One-hot encoding signals on outputs QO . . . Q9.5. This examplegives a 20 bit sequence, and loads the RTWO lines A and B symetrically.The state advances on each ½ cycle (i.e. Rotation) of the RTWO clocksignal. FIG. 4 Shows the internal components of a single-bit “MovingSpot” element used to make up FIG. 3 Strips.

*SYNC and SYNC equate to the signals on the left side of the drawing, Qnand *Qn equate to the signals Q9.5 and *Q9.5 on the right.

Wavegenerator using the “Moving Spot” sequences are more flexible than/N counters.

An arbitrary waveform with high and low times defined digitally withresolution of % RTWO clock period are available.

FIG. 5 Gives a circuit which interfaces to the Moving Spot generatoroutputs to digitally set the “On” and “off” times of an output clockwaveform (CLK_ARB) in terms of the high-resolution RTWO 1/2 period. Viathe buffer shown in FIG. 6

A “1” in the SET register will turn on the CLK_ARB output at thatsequence in the Movingspot sequence. Similarly a “0” in the RESETregister turns off the output at that time in the sequence. The CLK_ARBcan transition once per RTWO period at maximum and once perRTWOperod/Nsequence length,

-   minimum giving a frequency (two transistions) range of FRTWO/10 for    a 20 spot sequencer. The flexibility of the CLK ARB comes from the    programability.    -   Frequency can be adjusted by setting the global sequence numbers        where state changes.    -   High time, low time can be set independently—facilitates        pulse-clocks.    -   Deskew—programable global sequence numbers of the commencement        of the high-period and low can programmed individually for each        clock in the BWB    -   effectively allows programable de-skew to resolution of % RTWO        period (e.g. 50 pS @10 GHz RTWO frequency).    -   Gating—possible to gate clock off    -   Strobes and other specific, non-standatd synch signals can be        made and will be globally synchronous.

More than one CLK_ARB can be produced locally to each BWB, the SET andRESET and buffer circuitry have to be reproduced for each independentclock produced.

BWB sequences can be any length required, depends on the miniumfrequency required, Not all BWBs need to have the same sequence length(can use OR-gate to pass out SYNCH pulses at the intermediate point whena 20-long sequencer is linked to a 10-long sequencer.)

Using the BWB, a very close proximity to true-single phase clocking canbe approximated, at the reduced-frequency clock rates for legacyapplications.

The arbitrary (reconstructed) waveform edges are syncronous to the localarrival of the RTWO wave. For a conventional, regular RTWO loop array,with 360 degrees requiring 2 rotation times of an edge on the RTWO (180degrees per rotation), the highest level of nonsynchronisity between thefurthest two points on a loop (diagonally opposite corners—half arotation away from each other) i.e. 90 degrees out (1 cycle) at theFoverclock Nominating a single point on the RTWO to be “Phase angleZero”; you find that by using either *CLK or CLK line, any other pointcannot be greater than +1-90 degrees in phase error. (e.g. Moving from+90 to +95° point, you can use the other phase and this +95 degreesbecomes −85 degrees)

At IOGM, this is +1-25 pS, representing +1-Z.5% of a 1 GHz “virtualsingle-phase” clock well withing the 10% typical skew budget.

The error is stable and calculable and could be accounted for by addingtime to the minimum delay to prevent any race conditons. The fact thatthe phase is known makes it much easier to deal with than fitter whichis random variation of skew.

BWB are synchronised to each other by an interwiring line from the Qnoutput of one stage feeding the *SYNC SYNCH inputs of the next stage ina daisy chain fashion.

Controlled clock gating and orderly shutdown involves de-asserting theQn*Qn from the primary BWB.

In a reverse process to the startup, the BWBs will stop in sequence(since their SYNCH pulses stop).

Alternatively, individual BWBs can have their sequence data changed,allowing new waveshapes, phasing, frequency changes to be implemented.

Speed changing involves loading new data into the SEQ.CTRL registers,which get updated prior to count#0 or any other count code suitable.

Array storage for different sequence data to bo loaded in after eachsequence (effectively lengthening the sequence).

BWB and sequencers can also be used to make special clocks e.g.Handshaking signals, strobes etc.

Adiabatic Clock Generation—FIG. 7, FIG. 8 (Replaces FIG. 5 and FIG. 6)

RTWO signals are energy conserving, because electric (capacitive) andmagnetic (inductive) energy is continously re-used as a travelling wavetravels around a closed path. RTWO loops tend to produce very highfrequencies when applied on VLSI dimensions.

To support legacy interfaces and clock frequencies, Frequency division(i.e. dividing a clock frequency to produce another lower clockfrequency) has been mentioned previously for RTWO.

Unfortunately, Conventional frequency dividers and buffers Ike thosejust described are not adiabatic, i.e. they dissipate energy in drivingload capacitance.

This section describes the principle of Adiabatic frequency division.However, other options to slow RWTWO involve are possible.

-   -   making higher inductance values to slow the line down—increase        load capacitance to slow line    -   “wrap” multiple loops of RTWO line around a region to extend the        transmission-line length but maintain perimeter.

Adiabatic frequency divider outlined here gives another ‘slow-down’option.

In a pulse transmission-line system such as RTWO, line current chargesthe distributed capacitances for a forward-travelling ‘edge’. It ispossible to steer these currents to charge and discharge othercapacitances at frequencies synchronously related in frequency to themain loop frequency and thus generate low frequency.

The RTWO line doesn't “know” the difference.

In practice this is difficult to achieve in an efficient manner onanything other than a very modern (0.18 u or less) CMOS process.

Principle.

-   -   The principle used is the observation (looking at FIG. 8) that a        2-phase clock of frequency F, can be split into (2*N) phases at        frequency F/N.    -   Simple example would is splitting a 2-phase 4 GHZ clock into a        4-phase 2 GHz clock.        Table 1, Switches Operating During Sequence.

Count Switches On during this cycle inital transition, *Optionally

-   O A-J,B-L, *A-M, *B-K-   0.5 A-M,B-K, *A-L, *B-J-   1 A-L,B-J, *A-K, *B-M-   1.5 A-&B-M, CA-J, *B-L

Switches are controlled by the “One-Hot” state machine, similar to thatdescribed for the BWB units, but here just a 4-state machine.

*Optionally, Transistors above can be activated in the previous steadystate (platau level) to allow for transistor turn-on time before thenext edge occurs, and this means transistors are turned during aquiettime, with lower loss.

The unit labeled “Logic” incorporates simple gates to achieve theadditional output gating required by the * items in the table above.Without this option, the outputs 0, 0.5 . . . 1.5 just drive directlyone or more of the gates of the NMOS transistors for quadrature outputs.

There is no particular reason to adopt a quadrature signal sequence(Left hand side of FIG. 8) and any sequence of any number of phases canbe generated. The only limitation is that (ideally) every edge of theRTWO clocks should be switched into the same capacitance each time.

A useful version is the “One Hot” clocking scheme shown on the right ofthe timing diagram. These clock signals produced at J,K,L,M are able todrive capacitance adiabatically i.e. not subject to CV{circumflex over( )}2F power, although I{circumflex over ( )}2R power is lost in the‘On’ resistance of the Mosfets and the RTWO transmission-lineconductors.

In theory, Switching transistor gate capacitances can be adiabaticallyderived from any of the clocks, so this would not cause power wastage.

Effective Capacitance for the Main RTWO Line:

-   -   The capacitive load on each of the /2 frequency output phases is        C slow (representing logic load capacitances) then the        differential capacitance presented to the RTWO for the analysis        of velocity and impedance is C_slow/2 because at any time, the        RTWO (differentially) is charging two of the capacitors in        series. RTWO line operates as normal, unaware of the        ‘phase-splitting’ occuring at the adiabatic dividers (of which        there can be any number located anywhere on the rings)—it just        seems to drive capacitance as normal.        Descriptions Above Consider the Driving of Locally Capacitive        Loads.

Alternatively, or additionally, the clocks can drive othertransmission-lines e.g. to drive a “one-hot” pulse-clock to a remotelocation.

In effect, a J,K,L or M clock acts as branch on the RTWO line energy andimpedance-matching is required for low-reflection energy flow. (samecondition applies as capacitance i.e. the RTWO line should see sameimpedance on each part of the sequence)

Recombination of Energy.

-   -   The Multiphase frequency-divided clocks are inherently        bidirectional and can pass energy between JKLM and RTWOA,B in        either direction.

Interestingly, the ‘remote-end’ of the JKLM tap transmission-line couldbe recombined back into another location of RTWO line using JKLM phasepoint at another BWB. Globally, the sequence number is synchronoys, andtiming would be correct for the Mosfet switches to route the signal fromeither JKLM into the RTWO line. [Impedance matching, and timingconsiderations apply].

another use of JKL,M phasing scheme shown here would be to (synchronise)between two-phase F RTWO loops and 4-phase loops (Twn wraps around aperimeter—the alternative method) ½ F loops.—energy could go betweenthem and synch them together.)

Scan Test.

A Scan-Test block is shown within the BWB block diagram (FIG. 1 b). Thestandard JTAG boundry scan shift register system may be compatible withthe proposed global serial data interface, permitting scan chain logicto share the same DAT in/out, SCLK bus as the other BWB components.

FLL—Frequency-Locked-Loop

To synchronise arrays of RTWO chips without PLL and all its problems ofjitter, bandwidth and area.

Only a single FLL controller required per VLSI chip.

Previous applications described how passive transmission-line linksbetween chips are able to synchronise same-frequency RTWOs on themtogether.

Weak (ie. >>Zring) coherent links between chips will pull together twochips if the difference in frequency of the rings is small.

-   -   Getting the initial frequency difference small is the remaining        issue.        Frequency Locking is One Good Method

Use a Frequency-locked-loop—a very easy device to make from an up/downcounter—or could use a high precision charge pump circuit

-   -   REF_CLK can come from an external low-frequency F        reference—F_int can come from the RTWO clock /N    -   phase is unimportant, so edge rate etc, delays don't matter, you        dont try and control a phase, just F    -   Control the RTWO frequency using switched caps or varactor    -   Use the INNERMOST (centrally shown in FIG. 1) rtwo ring        (furthest away from the periphery where the frquency locking        connections are) to measure and lock the RTWO frequency.

This ring will be more-or-less independent of effects of frequency onnon-synchrous signals injected into the remote rings.

-   -   With the innermost rings of multiple RTWO chips operating at        identical frequencies, there is absolutely no preferred relative        phase to the outside world (it is rotating after all), it is        easy therefore to synchronise phase it with an imposed,        signal—will lose energy from rotation until fully in synch.    -    closer it is to synch, less energy is lost—Precautions    -   Weak linkage is subject to slippage—RTWO has to be made very        stable unless lots of linkages are present.

NOTE:—the above only works at one frequency—determined by the off chiptransmission-line time.—to fix this, can use external RTWO amp typedevices to trim those lines also—but gets tricky to coordinate the wholething.

FLL System Details

Two (of Many Possible) Methods. (1)

-   -   Dual charge pump—one pumping current in, other pumping it        out.—Calibration—drive both pumps with the same clock, and trim        until no output—needs a mux    -   Up/Down counter.

Reference: “Phaselock Loops for DC Motor Speed Control” Dana. F. Geiger,Wiley, 1981 pp v, pp 77-92

Method 1

Charge Pump Frequency Controller. (Chargepump fcomp.ps) FIG. 9.

Purpose:

To lock RTWO frequency to some multiple of an external referencefrequency.

Compares two frequencies and output a control signal proportional to thedifference between the frequencies to control varactor (or switchedcapacitors) applied to the RTWO line to modulate the rotation time,hence frequency.

Not a Phase-Locked Loop

/N counter is used to dividive down RTWO frequency to a lower frequencyfor matching to a low speed external reference F. Frequency comparisionis done at low frequency to ease the distribution of the reference clockwhich is difficult to control if full-speed reference.

Inverters: IA, I1, IB, 12—CMOS inverters (Pch/Nch)—Powered from supplyVDD, 0 v

Function:—each cycle of F1 frequency a charge equal to C1*VDD is pumpedto current mirror P1.—each cycle of F2 frequency a charge equal toC2*VDD is pumped to current mirror P2.

When frequencies are equal, the current (charge*frequency) of the abovetwo currents will be equal (for C1=C2).

In this case, the matched transistors P1,P2 will force zero current tothe P2 drain, keeping voltage “VARACTORV” steady.

A mismatch in frequency causes mismatch in P1,P2 currents, and“VARACTORV” will slew in a direction and magnitude proporotional to themismatch in frequencies.

This adjusts the varactor voltage, hence RTWO frequency to restore RTWOfrequency to that of a multiple of the lowspeed reference elk.

This is an in-princple description, applicable to other charge-pumpschemes known in the art.

Calibration is possbe in the above circuit by routing the F1 and F2inputs to the same REF clock using the MUM. In this condition, thereshould be no output drift or VARACTORV from the bias point VDD/2 volts.CAL h and CAL l are inverters with modified thresholds which can be readby a state machine to determine if the frequency comparator is accurate.Self-Trimming is possible by many means e.g. changing (binary wieghting)of C1 or C2 capacitors using known switched-capacitor means—or byinjecting a programable offset current into either P1 or P2 draincurrent. Accuracy of 0.1% can be expected and this is enough to allowfor hard-wired phase locking over passive links for RTWOs (described inearlier patent applications).

Method 2

Digital Counter System. (counter_fcomp.ps) FIG. 10.

Reference: “Phaselock Loops for DC Motor Speed Control” Dana. F. Geiger,Wiley, 1981 pp v, pp 77-92

The reference cited above outlined a practical approach to DC motorspeed control using a digital up/down counter to compare frequencies.The approach of controlling Frequency as the primary loop variable givesa much more stable loop than Phase/Frequency detector systems which havemarginal stability

The operation is straightforward. design a binary counter which has anUP and and DOWN clock. The UP clock is fed from frequency F1, and theDOWN clock is fed from F2.

When frequencies match, the counter gets net zero increment or decrementof it's count value and alternates about the same value.

Addition of a DAC and a control loop (in this case Varactor control ofthe RTWO frequency) forces the counter to jitter around value 0.

An 8-bit counter using 2's complement notation gives signals of +127 to−128 which the DAC scales to an output current to drive VARACTORVdirectly or via an analogue integrator.

Varactor trimming can achieve +/−20% frequency variation, but largertuning range can be achieved with switched capacitors [Sec FIG. 16]. Theaddition of the digital comparator block and Counter2 can supplementvaractor control when it alone is not sufficient to achieve frequencylock. The operation of Counter2 controls the Switched-Capacitor arraysdistributed around the chip—it's value is distributed to all BWB blocksusing the shift register mechansim.

The design of the binary Comparators makes the Counter2 increment ordecrement whenever the error counter (Counter1) is out by more than 8 or−8 (chosen arbitratily) respectively. This selects larger or smallerbinary weighted capacitanced added to the RTWO line to bring thefrequency into a range where Varactor fine-tune control can fully closethe loop.

FIGS. 11 to 16 inclusive show component details of blocks referred to inpassing in the main text (see below for descriptions).

-   file list.-   TurboCad:-   hierO.tcw—main block diagram-   [-   hier2.tcw—mechanism for digitally setting the “on” time and “off”    time for arbitrary (non-adiabatic) clock generator (to feed to the    buffer)-   Xcircuit:-   adiab_(—)1_sch.ps—Components of adiabatic 4-phase generator (see    also adiab_(—)1.sda)-   buffer_block.ps—Non adiabatic CMOS buffer with individual inputs to    control crosscondution-   chargepump fcomp.ps—Charge-pump frequency comparison method.-   counter_fcomp.ps—Digital up/down counter method of frequency    comparison.-   moving_spot_reg.ps—one method of making a “moving spot” register.-   spntmove elem.ps—expansion of the basic moving spot element XA.ps    -   Switched-size inverter cell (digitally controlled).-   XB.ps—stobe cell (for automatic generation of stobe in absence of    SCLK)-   XC.ps—shift register (single bit)-   XD.ps—latch cell (for holding shift-register values with Strobe).-   XE.ps—Complete cell for digital sized RTWO inveter cell (back-back)-   XF.ps—Complete cell for digially controlled Switched RTWO Capacitor-   XG.ps—Switched capacitor (single bit).-   Staroffice:-   adiab_(—)1.sda—possible 4-phase clock signal sequences which can be    generated adiabatically.-   fdiv_(—)1.sda—picture of a /N counter block and a “Moving    British Application No 0214850.0

The figures referenced below are those shown on sheets 18/53 to 20/53 ofthe drawings of the present application.

High performance dynamic clocked logic family for use with RotaryClocking or other adiabatic clock source background material regardingRotary clocking and RTWO, ROA is contained within patent applicationPCT/GB00/00175 which is hereby included complete by reference.

Background

Logic circuits on CMOS VLSI can be classed as either Static or Dynamic.

Static Logic:

Static logic gates are the norm. They use complementary devices—Nch's togive logic 0 output, Pchs to give logic 1 outputs. There is norequirement for a clock to perform the logic operation, but clocks ARErequired for latches which capture and sequence the results of the logicoperations.

-   -   FIG. 1 a conventional static CMOS Nand gate [latches and clocks        which are required elsewhere sre not shown]        Dynamic Logic:

Dynamic circuits use only Nch devices in their evaluate paths and so areusually only able to output logic Os. The logic 1 values are establishedby using a Clock circuit to ‘precharge’ the output to 1 whichinitialises the output before the possibly −0 output.

The advantage of using only Nch devices is that they have between 2-3×better electron mobility and so give lower input capacitance for a givenswitching drive ability.

Dynamic, (or clocked logic as it is also known) has a long history.

Although largely displaced by CMOS (Pch & Nch) static logic, dynamiccircuitry has a niche where maximum performance is the main requirement.

Many forms of dynamic logic have inherent storage and so often latchesare not required in a dynamic logic system.

FIG. 1 b conventional dynamic CMOS Nand gate whose output is prechargedto VDD when CLK is low, and goes low only when CLK goes high and bothlogic inputs are also high (for the Nand function).

A further classification of logic circuits is adiabatic andnon-adiabatic.

Non-Adiabatic:

These are the norm where the energy for logic evaluation and outputcomes from the power supply rails. Energy expended in charging theoutputs and interconnect is wasted each time a logic transistion occurs,effectively it's just like charging up a tiny battery and thendischarging it with a short circuit each and every cycle. Power isrelated to C*V{circumflex over ( )}2*F and at GHZ frequencies even atiny capacitance causes massive power waste.

Adiabatic:

Energy for logic evaluation and output drive comes from a ‘reversible’energy source and the charging of the capacitances involved in logicswitching is done progressively by a voltage source (e.g. a sine-waveclock) which is always close to the instantaneous voltage on thecapacitance being charged or discharged.

The gradual, or adiabatic charging results in recoverable energytransfer. Energy is just being moved around between logiccircuitry/interconnect and the clock energy.

FIG. 1 c is a potentially adiabatic logic gate because it is poweredfrom an RTWO circuit which is an adiabatic voltage/charge source/dump.

In principe Rotary Clock can power any known Clock-powered logic circuitwith greater speed and efficiency than sine wave or resonant circuits.

DESCRIPTION OF INVENTION

Dynamic, Adiabatic, Rotary-Clock Logic Family.

Rationale:

Dynamic logic is the highest performance logic technique, Adiabaticlogic has the lowest power consumption, Rotary Clock technogy is thehighest performance adiabatic timing signal generator.

Combining these three attributes should give the best possiblepower/performance of any synchronous logic system and the rest of thisdescription outlines such a logic family we are calling DARL (Dynamic,Adiabatic, Rotary-clock Logic family).

DARL logic circuits are sequenced and energised by Rotary Clocknetworks. Rotary Clocks have the unusual ability to drive considerablecapacitance with a high frequency square wave without incurringCV{circumflex over ( )}2F power consumption due to an inherent recyclingmethod.

DARL logic circuits extend this power-saving benefit to logic circuitevaluation and signal-interconnect capacitance driving. If this could beachieved in practice, there is the real possibility of eliminating mostof the power consumption of a typical VLSI chip.

Losses are made up by the active circuitry on the RTWO lines whichrefreshes both the clock and the data interconnect losses.

Circuit Description.

FIG. 2 And/Nand—Gate Followed by Buffer/Inverter.

The underlying concept of this logic familiy is that the Rotary clockenergy is routed adiabatically to the output capacitance by Nchtransistors based on a logical combination of input signals. One orother of the outputs transitions with the Rotary clock wire giving auniform capacitive loading as seen at the RTWO.

For a simple inverter/buffer, the CLK signal is routed to output Q ifthe inputs are logic 1, and routed to *Q if the inputs are logic 0.

True and Complement inputs and outputs are a feature of the logicfamily.

The main visible features of the circuitry for each gate are:—Inputsampler or resistor

-   -   Nch transistors with intrinsic gate capacitance—Logic path 1    -   Logic path 2    -   Interconnect, or output capacitance.    -   Optional extra storage capacitance on the inputs after the        sampler.

In the case of a resistor in lieu of a sampler, the gate-drivecapacitance is not being driven fully adiabatically. To recover thesmall enery here would need a derivative phase [e.g. A quadrature phasefrom a 4-phase RTWOJ. It may not be worthwhile in practice since most ofthe load capacitance in modern chips is clock and interconnectcapacitance.

Waveforms for DARL Buffer/Inverter [FIG. 3]

There are two phases of operation for each gate:

Sample/Evaluate (Logic Phase 1)

-   -   This state begings with CLK beginning its low-going edge.

Whichever logic path had previously propogaind a “1” will now have it'soutput returned to 0 because the logic path is still on (haven't yetsampled the new data), and so CLK is still connecting to theoutput—Note, it falls at the same rate as the clock since it isconnecting to it—this ensures adiabatic discharging.

-   -   During CLK low plateau, both logic paths (1&2) sample the input        signals from the previous stage which is currently propogating        it's evaluation. This may alter the active logic path but since        the outputs will already by at logic 0, they cannot change.        Charge stored on the gates of the Nch represents the sample        node. Additional capacitance could be added.    -   For gates with more than one transistor in each logic path, each        will sample and the series or parallel path of the transistors        constitues a logic function. Only one or other of the logic        paths can be active.    -   the outputs Q and *Q will be at logic 0 (actively pulled to CLK        voltage for one logic path, memory of Ov for the other logic        path).        Propogate (Logic Phase 2):    -   CLK going high represents the Propogate phase of the logic        process.    -   Where a sampler is used on the inputs, it is turned off at this        point to prevent the previous logic stage from removing the        sampled signal (possibly this switch off is done by CLK*CLK or        by another phase point from the RTWO or by a logical combination        of phase points to get an exact timing window—see illustrations)    -   There will be ohmic path from CLK to either Q or *Q depending on        which logic path evalutated. This ohmic path is maintained by        the charge on the gates of the Nch transistors.    -   CLK going high therefore is coupled to either Q or *Q. The        transition follows the RTWO clock line closely because it, is        connected to it through some resistance from the Nch        transistors.    -   Sizing of the Nch transistors is critical to making sure the        charging/discharging is low-loss (adiabatic). Adiabatic        charging/discharging is realised when there is very little phase        lag between the RTWO clock and the output waveforms (low voltage        over the resistance of the mosfets).

To create a logic pipeline alternating CLK and *CLK powered gates areplaced in series. There are no race conditions since one state issampling while the previous and next are propagating—logically this isvery much like a classic 2-phase latch style which imposes it's ownwell-known constrains on feedback paths.

FIG. 2 illustrates this showing how the preceeding AND gate is drivenfrom the opposite (typically) phase.

Phasing:

Rotary Clock is locally 2-phase with 360° “liquid” phase availableglobally. Advantage can be taken of the geographically variable phasingto improve timing. The 180 degree phasings in the simplest local caseabove is just an example. Sequentially connected DARL gates with lessthan or more than 180 degrees of phase separation on their clock sourcescan be useful. e.g. Time borrowing/stealing and for fractional-cycleoffset synchronous repeaters.

Capacitances:

The Rotary Clock line sees a capacitance loading on each transiston.Either the Q or the *Q output is transistioned. There are threebalancing requirements for ideal performance (Note that perfect matchingis not required but waveshape distortion is likely when mismatches are>10%).

Balancing Condition 1:

-   -   Interconnect capacitances on Q and *Q for each gate should be        equal on a per-gate basis (by padding if needed) to keep        constant capacitance seen from either CLK or *CLK depending on        the gate.        Balancing Condition 2:    -   To operate differentially, CLK and *CLK should have matched        capacitances. On average in any local area, the capacitances        driven by CLK and those driven by *CLK should be matched.        Balancing Condition 3:    -   At the long-range and global levels, balancing and impedance        matching (kirchoff type) is performed as documented for RTWO        line balancing since the logic appears as normal, fairly        constant clock load capacitance.

The circuit just described is just one example of a circuit which steersrotary clock [or any uniflow transmission-line energy] selectively andin a balanced manner. The upshot is that Logic gates themselves, and thelogic interconnect capacitance become just another part of the rotaryclock capacitance. Software such as Rotary-Expert (REX) call design asuitable layout. [PCT/GB2002/005514 incorporated herein by reference].

This principle extends to driving any capacitive load, and couldcertainly drive DRAM SRAM or other memory decode lines in an adiabaticfashion.

RTWO Structures/Inductance Options.

Classic RTWO structures can be used with vias and multilayerinterconnects to route down from the RTWO lines to the logic gating toprovide the clocking. At higher frequencies, the vias themselves and theshort-range interconnect become significantly inductive. It is thenpossible and sometimes important to treat these as part of the RTWOlines, or as RTWO lines in their own right, and move to thebranch-and-combine flow matching algorithms during layout [re softwarepatent] instead of just treating the logic gates as stub loadings on themain RTWO.

Sense Amps:

FIG. 2 also shows some cross-coupled Nch devices between the outputs andoption for a push-pull sense amplifer. These can help to enforce adifferential potential difference in the presence of noise, and can givea return current path for capacitively coupled signal in the non-drivenlogic path output.

Further Refinements on this are:

-   -   Nch/Pch back-back inverter version (shown).    -   Connecing common drain points to opposite clock line instead of        to supplies.        Device/Substrate Options:

SOI process is ideal vehicle to exploit this logic family because of theabsense of body effect, drain and source parasitics.

Bulk CMOS process will work OK. Where individual Pwells are availablefor the Nch devices, the Nch logic path transistors would benefit frombeing co-located in a Pwell islands each connected to the correspondingCLK or *CLK rotary clock signal associated with the logic gate.

Pmos devices are still required for RTWO top-up function, unless specialall-Nmos bridge was used.

To cope with the ‘hot-gate’ voltages seen on gate nodes like GBA, thesampler transistors may have to be higher-voltage devices such as I/Otransistors.

Applications—

-   -   Logic gates    -   ALUs    -   Memory decoders    -   Synchronous repeaters—buffering using DARL buffers at        known-phase points regenerates and retimes data transmissions.    -   any other digital circuit.        Advantages    -   Fastest speed—dynamic logic—all Nch in evaluate path    -   Two-phase logic—two evaluations per clock cycle.—Differential        (true/complement) outputs available.—Fully pipelined.    -   Clock powered—VDD/VSS connections not required—AC power—very few        electromigration problems.—No latches required.    -   Lowest power—adiabatic i.e. asymptopically zero power—Small        area.    -   No leakage current issues.    -   Low skew, jitter, phase locking—Rotary Clock, RTWO, ROA        advantages    -   Tiny Data skew—data transistions are forced to align with clock        since the data is essentially the same signal as the clock.    -   forces the clock to be the same speed as the data flow.        Lightspeed—British Patent Application No. GB0218834.0

The figures referenced below are those shown on sheets 21/53 to 28/53 ofthe drawings of the present application.

High speed on-chip interconnect using ‘blip’ mode driver and multiphaselocked rotary clock for signal generation and sampling timing.

A combination of a ‘blip-mode’ driver circuit, interconnect layout andRTWO sychronisation can achieve very high speed for on-chip datatransfer e.g. 10 mm in 70 pS flight time, and is very economic in termsof interconnect, active area and power consumption. Improvements arealso possible to multi-phase operation, and rotation locking.

Patent applications International WO 00/44093 and Hierarchical clock GB0203605.1 are the background material included here by reference.

Note that throughout the text, reference is made to a 4phase system Thisis by way of an example, and 1phase, 2 phase, 8 phase or any number ofphases could be used as the basis of the circuitry. RTWO clock generatoris preferable but other clock generators could concievably be applied.

Background.

High speed synchronous signalling over long-distances on chip isdifficult in practice due to interconnect parasitics and clockskew/jitter. Possible solutions e.g. use of wide, low loss traces andPLL, differential receivers etc are usually too excessive in chip areaor metal usage to be used throughout a chip.

On-chip interconnect operates in either RC mode or LC mode of signalpropagation depending on the resistivity of the wire, the rise/fall timeof the sending signal [1].

Today, increasingly longer wires, higher operating frequencies and lowerresistivity through copper interconnect has led to LC(transmission-line) mode behaviour exhibited on-chip. Ringing andovershoot can occur on incorrectly terminated lines. The usual method ofdealing with this involves breaking up long transmission lines intoshorter segments (where LC effects are not seen) and inserting repeaters(CMOS inverters) in-series with the line periocially. This drasticallylowers the effective propogation speed due to inverter delay andfurthermore makes delay variable on inverter characteristics. Thislatter problem causes data skews and jitter in synchronous busseslimiting available frequency operation.

The option of using correctly designed transmission-lines withterminations although viable to 50 GHz [2] is seldom used due to powerconsumption problems and area constraints [most on-chip network circuitsneed PLL/DLL and differential receiver, transmitter etc].

This document outlines new circuits and interconnect arrangement whichcan exploit LC behaviour at low power consumption by using a “blip”driver (meaning a driver with momentary pulse excitement of either +Veor −Ve polarity) together with pseudo-differential signalling anddetection from self-biased inverter receiver.

Circuit/Interconnect Description.

FIG. 1 a shows the cross section of proposed interconnect topology onchip configured here to create a multi-bit signal path. Each signal issandwiched between a power (VDD) and ground (VSS) line to form a coaxialtransmission line to transfer an electrical signal from point TX to RX.On CMOS with SiO2 dielectric, the velocity is 0.5 c which equates to 7pS per mm. Perpendicular routing patterns underneath can be combined atcorresponding VDD, VSS points to form a power grid. Signal paths canalso change layers and therefore direction. Not limited to orthogonalrouting, the layout would work on 45 degree layout rules also.

FIG. 1 b is the circuit diagram of a transmitter driver/receiveramplifier/bias. Typical values are.

-   Transmission-lines    -   Length 4 mm    -   Metal type: Alumimum/Copper, Thickness 1 micron    -   Line width: signal 1 micron, power 2 micron    -   Impedance ˜50 ohm-   Transistor widths:—all 0.18 u CMOS, gate length=0.18 u    -   N1 20 u N2 20 u N3 20 u    -   P1 50 u P2 50 u P3 50 u-   Resistors    -   RFB 400 ohms.

Supply current total 2.2 mA TX, RX when active at 1.5 V supply 4 Gbps

(Compares to Cinterconnect*V*F/2=2 mA−the equivalent current of drivingjust the capacitance with full-height NRZ signal.)

In operation, a data stream controlled by local clock signals at thetransmitter location, pulse either_send1 or send0 signals. A currentlimited pulse flows through either N1 or P1 down the line at thespeed-of-light for the medium (eR=3.9 for SiO2, Vp=root(3.9)*c).

FIG. 2 a Gives simulated Spice results for the circuit operating at 4GHz with drivers driven during one-phase period of a 4phase clock.

Some details to note:

-   1. Termination impedance is a combination of 1/transconductance of    N2,P2+RFB and will be probably be higher than the line impedance.    Higher than expected received signals are achieved but reflections    are not a problem due to the lossy nature of the line (almost no    energy sent at TX will get back—see below).-   2. Resistance of the signal conductor may be upto 5× the impedance    and so is very lossy and dispersitive.-   3. Two modes are operational 1. LC transmission-line mode and 2.    slower mode where the effective termination impedance of N2,P2,RFB    work with the total capacitance of TXRX line forming a highpass    filter.-   4. The “blip” of duration can be much less than the total clock    cycle time

The highest wiring density is achieved through using the smallest widthpossible on the signal and screen wires. Using the smallest widthpossible while still giving transmission-line type high velocities [1]results in sizing the cross-section to exhibit a resistance ofapproximately 2× to 4× the impedance (Z0) of the line. Ordinarily thiskind of attenuation is difficult to cope with because for the usual NRZencoding, the received amplitude is very data pattern dependent and noteasily detected.

Using short-duration ‘blips’ serves two purposes—1. saves power becausethe driver is only active for a short part of a clock cycle. 2. Fixesproblem of attenuation of the lossy interconnect media as it spreads thepulse out in time because the the self-bias receiver's terminationeffective resistance restores the mid-supply bias in time for the nextpulse to come down the wire with RC action.

The key point is that each new pulse is received free of remenants ofthe last pulse and therefore the receiver can be made sensitive—in thiscase using a 2-stage amplication involving secondary inverter N3,P3.

Contrast this with any kind of NRZ signal format which on a pathsuffering this much attenuation would need special precompensationmethods to avoid pattern dependent DC drift in the receive amplifier.

[Another option realisable with the same driver circuits is Manchesterencoding, but this would suffer a power consumption cost]

VDD and VSS wires are used to shield the signal line, which is centrallylocated between the VDD, VSS and so exhibits very little magnetic orcapacitive signal injection for the expected differential-mode surges onthe supply lines.

Additionally, by careful selection of the ratio of the width of powerlines vs. the width and spacing to the signal wire can result incancellation of coupled magnetic noise from one signal line to the next

Finally, the N/P ratio of the N2,P2 reciever circuit is chosen for aself-bias voltage of approximately 0.5×VDD. This eliminates signalamplification of differential swings on the supply voltage at thereceiver end.

In total the circuit is very noise immune for following reasons.

-   -   Normal differential supply noise does not effect the received        signal    -   Coax construction shields the signal wire    -   Termination (self-bias) forms a highpass filter with the signal        line rejecting lower frequency noise from the supplies and from        signal couplings.

VDD, VSS wiring is not wasted and works to supply power around the chip.Interestingly the mutual capacitance they share with the signal lineaids in decoupling the power supply.

Importantly, the line can serve as a true bus, not just a point-pointdata link. Signals can be tapped anywhere along the line—FIG. 2 b Plotsthe signals at various points along the transmission-line. Each tappoint can drive a circuit similar to N2,P2,N3,P3 but either (1). withoutRfB—only the far end needs the self-bias circuitry or (2). using RfB ateach detector of higher value to distribute bias along the length. Withthe high resistance signal wire, mismatches of inverter bias voltagecould be tolerated. AC coupling of the intermediate detectors is alsopractical.

Data at different tap-points will be phase delayed so the best places totap into the data lines are the points where they cross over the RIWOlines. Here, the best phase (1-of-4 or however many phases exist) can beused to sample and synchronise the data.

FIG. 1 c is the equivalent electrical circuit (discounting resistancewhich is in the wires) illustrating L,C and couplings which exist.

“Blips” are generated using either a monostable circuit triggered fromone edge of the local clock, or, by one phase of a 4phase rotary clocksequence [see FIG. 3, FIG. 6 for 4 phase layout of RTWO in grid).

Clocking

It is assumed that the chip with be equiped with RTWO clock structuresto give a distributed phase-locked clock available at all points of thechip.

Multiphase clocking (beyond 2) involves making multiple wraps ofdifferential wiring before inserting a net crossover in the signal pathto form a single unbroken wire. FIGS. 6 And 7 Show possible 4phase RTWOstrucutres arranged on grid basis.

FIG. 5 Shows a set of circuits which can be attached to the 4-conductortransmission line mentioned above at any cross-section point to powerand sustain rotation. Conditional inverters CI0 . . . CI3 illustratedeliminate cross-conduction current. Small normal inverters between 180degree points can be added to initiate start up and together with theCI0 . . . CI3 will work to ensure that only one direction of rotation asdetermined by the ph0 . . . ph3 sequence desired exists—which has to bematched to the ‘winding’ direction of the RTWO double loop. Thealternate sequence of CCW rotation would be poissble either by 1.changing the inputs to CI0 . . . CI3 around or reconnecting the 4phasegrid connection points to reverse the rotation direction in the obviousmanner.

Signal Serialising

Links can send non-serialised databits at a rate of the RTWO frequency.[as described in the data transfer application, number??? - - - -divisional].

Another option is to serialise data at full rate relative to a lowerfrequency clock which drives the local logic (as might exist on a 500MHz asic driven by a /8 counter from a 4 GHZ RTWO. In this case, 8 databits could be sent per ASIC clock cylce on a single wire).

Clock source.—A 4 phase RTWO oscillator provides the Transmitt clocks.

PhJ,K,L,M are each chosen from one of ph0 . . . 3. PhK and PhL should be90 degrees apart because when these are ‘AND’ed they set one ¼ of acycle period for the output ‘blip’ duration.

FIG. 8 is a possible 4 phase layout according to [Hierarchical????patent number).

Transition Signalling:

Power can be saved using transition signalling—i.e. Only activate eitherN or P when the data changes. ‘0’ going but would generate the +Ve blip,‘1’ going event a −Ve blip. Static stream of 0's or 1's from the TXshift register would not cause any signalling event and the receiverretains its last state by hysteresis.

TX circuit of FIG. 3 achives this by comparing the new data bit (Q0)with last databit (Q-1) generating no pulse when data remains the same.[Q-1 is an extra stage on the shift register to store the last data bittransmitted]. The TX register is clocked at the full RTWO clock rate andis loaded in parallel fashion at a clock some divisor of the main clock(via /n counter).

RX circuit needs just a little hysteresis in these cases to maintain theprevious switched state in the absense of new pulses at each bittime—Rfb2 can provide this hysteresis.

Forth possible special signal state exists, that is, sending two or moreconsecutive blips of the same polarity [the transistion signalling willnever send this sequence]. It could be used to indicate condition codese.g. Strobes.if designed to recognise it (This is not shown on anydiagrams but would involve modifing the logic at Q0, Q-1 whichdrives_send1, send0).

Alternative approach could be to signal with unipolar pulses (just N1firing) but with modified threshold of N3,P3 pair to output a default‘1’ until an incoming −Ve blip sets Q to 0.

Signal De-Serialise.

The signal lines are routed on chip to the destination point at whichthere is another RTWO local clock which will be phase locked to the TXRTWO clocks by virtue of hard-wired or other couplings between therings.—see FIG. 4 and FIG. 7

The choice of phasing is designed to time the data sampling of the RXsignal with the exact arrival time of the incoming data pulse +accountfor receiver amplifier delay. A locally 4-phase RTWO tap gives 90 degreechoices. Higher resolution can be gained by ‘sliding’ the sampling pointto cooincide exactly with a selected any-phase point. [as described inthe data transfer application, number???]

Deserialiser:—

Data from the Q output of N3/P3 is sampled using N4,N5 gated by theoverlap of two RTWO clock phases PhX,PhY chosen from two 90-degreeseparated phases from ph0 . . . 3 (4 phase system). For 2 phase system,one transistor operating off one of the phases would work.

Sampled data is clocked into the local shift register to produce aparallel output every n cycles where n is the divide-ratio of the /ncounter.

REFERENCES

-   [1] Alena Deutsch, et al, “Modeling and characterization of long    on-chip interconnections for high-performance microprocessors”    IBM J. RES. DEVELOP. VOL 39, No 5, September 1995 pp 547-567 (p 549)-   [2] Bendik Kleveland, Thomas H. Lee, and S. Simon Wong “50-GHz    Interconnect Design in Standard Silicon Technology” IEEE MTT-S    International Microwave Symposium, Baltimore, Md. Jun. 7-12, 1998    web: http://smirc.stanford.edu/papers/mtts98p-bendik.pdf    Piped Buffer—British Application No 0225814.3

The figures referenced below are those shown on sheets 29/53 to 31/53 ofthe drawings of the present application.

High temporal accuracy, high power, multistage pipelined CMOS buffer.

Patent applications PCT/GB00/00175 and GB 0203605.1 are hereby includedby reference.

Background

VLSI CMOS logic devices frequently employ buffers (current amplifiers)in order to allow control signals to quickly drive capacitive loads suchas those resulting from interconnect or transistor capacitance.

Traditionally, a chain of CMOS Inverters with progressively largerstages will be cascaded to form an effective buffer between a low-drivesignal and a highly capacitive load such as a clock load. More stagesgive a more powerful output and faster transition (rise/fall times) butresult in increased propagation delay between an input transition andthe output transition. Furthermore, this delay time is not constant butdepends on CMOS Process/Temperature and supply Voltage (PVT) variations.

Variations act to modulate the delay time of any buffer and for examplea 10% supply voltage variation can produce a 10% delay time variation inthe buffer.

In applications such as clock distribution, the temporal accuracy of thesignals is vital. For clock system catagorisation, Delay time is termedSkew and delay time variation is termed Jitter.

FIG. 1 shows the usual construction of a standard CMOS multistageinverting buffer.

Until recently, lithographic scaling of CMOS has produced increasinglybeneficial performance from buffers. At each generation, the processshrink produces faster transistors which would imply lowered skew butnow the transistor variations e.g. length variation on devices with gatelengths of 0.13 u or below can produce buffers with delay times whichare badly mismatched with respect to each other even on the same die.Another issue with device scaling is reduced supply voltage and highersupply currents which leads to power supply noise which impacts directlyon jitter through delay modulation.

For clocking applications, where buffers are placed all over a chip, andit is critical to match delay times [the exact delay doesn't reallymatter] buffering becomes problematic and it has been reported that asmuch as +/−1000 pS uncertainty can result.

Besides delay variations the common buffer exhibits two more undesirabletraits.

-   Excessive input capacitance.    -   Each stage has a P and an N transistor with typical total        capacitance of 2.5+1=3.5 relative units. For any transition of        the buffer all this capacitance must be charged to the other        polarity. This slows down the buffer performance because each        stage must charge one transistor off and charge the other        transistor to turn on before the next stage is active.-   Shoot-through, or cross-conduction spikes.    -   Each Pch/Nch inverter stage exhibit a direct current path        between S-D of the Pch then D-S on the Nch when the input        voltage is in transition.    -   Upto 10% of clock power is wasted by simultaneous conduction        during the transition periods.        Problem List of CMOS Buffers.

To summarise, the standard CMOS buffer exhibits the following negativeattributes:

-   -   Excessive delay time of the long inverter chains required (upto        20 distributed stages in clock distribution applications        produced by CTS [clock tree synthesis tool]).    -   Delay variation (skew) due to deep-submicron process control        problems.    -   Jitter introduced by supply voltage noise modulating the already        excessive delays.    -   Excessive power consumption (well above Cload*V{circumflex over        ( )}2*F) arising from excessive buffer sizing to achieve        acceptable delays.

The effects of items 1. and 2. can be largely offset by use of feedbacktechniques such as PLL (phase-lock-loop) and DLL (delay lock loop), butthese will increase the problems 3. and 4. and also impact of chip area.

Pipelined Approach to Buffering of Clock Signal.

To reduce problems 1, 2, 3 above a buffer should be made to have thesmallest delay possible: This would suggest the lowest number of stagesin a chain, ideally just one stage. However, this is not feasible sincethe circuit driving the buffer is usually a weak signal—e.g. Logicsignal which could not drive the large single buffer directly.

For a periodic clock generation application it is known that the overalldelay of the buffer does not matter as long as the delays are matchedbetween buffers and therefore the clock signals are fully synchronous.

This knowledge allows for a pipelined approach to buffering. Pipeliningof logic is well known where each logic stage is controlled by a clocksignal to complete its logic evaluation before the next clock eventwhereupon it passes the result to the next pipe stage. Logic pipelinescan be long with high overall latency (many cycles) but with athroughput of one operation per clock cycle (once the pipe is full).Creating the simplest form of pipelined buffer is effectively the sameas making a logic pipeline but with no actual logic involved at eachstage, just passing on the same input state (or inverse of input state)to the next stage synchronous to the clock edge.

**Logic could be added within the pipeline to allow for logical clockgating. If each stage of the buffer pipeline is made progressivelylarger (in terms of transistor width) the signal becomes stronger (as init's drive ability) as it moves down the pipeline and can be magnifiedto any required strength by adding new, increasingly larger pipedstages.

Delay time of the pipelined approach is always likely to be greater thana conventional CMOS buffer chain because of the clock overhead but thekey point to note is that the delay time is controlled to be N clockcycles (N is length of pipeline)+1 buffer delay time (the final buffer).Uncertainty is that of a single-stage buffer—the N cycle delay time isnot relevant to a periodic signal such as a clock.

**Clock gating applied in the pipeline for glitch-free operation.

Separated Path Approach to Buffering of Clock Signal.

The normal CMOS buffer of FIG. 1 has what can be called a ‘combined’path for the different polarities of signal to be amplified i.e. thecircuit path along which a logic “1” input signal travels to the outputis the same as the circuit path of a logic ‘0’ through the Pch/Nch pairinverter stages. This leads to excessive delay (mentioned previously)compared to a separated path design described below.

To speed up the delay times of a buffer, it can be split into two paths(two separate circuits combined only at the output and/or input), the “1drive” and the “0 drive” path.

Each path can be very fast as each circuit has large transistors only toperform the ‘turn-on’ path for the particular output polarity (smalltransistors are still needed to reset the path ‘off-line’ on thenon-active output period but these do not impact the speed). The lack oflarge devices to be turned-off is in contrast to the conventional CMOSinverter chain where the non-active polarity transistors can slow downthe progression of any change of state in the buffer

The separated ‘1’ and ‘0’ paths are combined at the output side and aside benefit to the separated path system is the absence ofcross-conduction current spikes when designed correctly. It isstraightforward to make the final Nch and Pch devices neversimultaneously active by controlling the signal timings of the twopaths.

EXAMPLE EMBODIMENT OF THE IDEAS

FIG. 2 is a block diagram of an illustrative example of a globalclocking system incorporating the pipelined, split-path buffer to drivethe final clock loads.

A high frequency 4-phase a 3.125 GHz Rotary Clock network covers thewhole chip with a phase-locked clock. Local frequency division or morecomplex waveshaping logic (BWB see GB 0203605.1 application) producesthe required clock signals for feeding to the buffers.

In this example, a 1 mm×1 mm grid of BWB and buffers is used and eachbuffer is required to drive upto 50 pF in its 1 mm2 area.

Moving Spot Generator.

A ‘moving-spot’ pattern generator [FIG. 2] driven from a tap into thehigh speed 3.125 G rotary clock provides the timing sequence signals forfrequency division and/or arbitrary waveform generation. Two stages areshown. For more than 2 stages, alternating stages are clocked with CLK90and then CLK270 (or other clocks 180 degrees out of phase).

The circuit works by transferring a ‘1’ on the OUTN to OUTR+1 during the‘high’ time of the respective clock.

This circuit can replace those of [Application GB 0203605.1] and hasoutput waveforms like those in FIG. 3 for a 6 stage design.

The sequence advances on each edge of the 3.125 GHz clock (6.25 GHz ratei.e. 160 pS intervals). Feedback transistors nclr and pclr clear theprevious stage back to the quiescent state as the new ‘spot’ position isreached. Bias transistors (not shown) are connected like nclr and pclrtransistors but have their gates connected to vdd and 0 v respectivelyand are sized to provide a light bias current to absorb leakagecurrents.

Moving-spot generators are located (along with the typically the rotaryclock electronics) at the junctions of the Rotary Clock grid. Phasing ofthe global clock between any two corners is at most +/−30 pS at 3.125GHz when the correct choice of one-of-4 local phases is tapped. It ispossible to design the buffers with slightly different delay times tooffset for the known phase difference of the source clocks.

To synchronise multiple ‘moving spot’ generators, the final output ofone generator is connected to the input of die next generator on thechip. These links are arranged so that a master generator (which is theonly one arranged to produce a circular patern (last output fed back tofirst input)) is able to force all other generators to move in step withit. It will take many ‘wrap-arounds’ for the synchronisation to ripplearound the whole chip.—FIG. 2 shows this.

To minimise the chip area consumed by the moving spot sequencers (whichcould be upto 100's of bits long) the transistors would be sized closeto near-minimum feature size. Such small circuits have weak output driveability and need to be buffered before they can drive what might amountto a 50 pF local clock load.

Pipelined Buffer Circuits.

A split path pipelined buffer is shown in FIG. 4

The upper path is the “1” output path finishing with a Pch device.

The lower path is the “0” output path finishing with an Nch device.

Each path has some resemblance to the moving-spot generator circuitry inthat a signal moves along with each ½ clock cyle, but in these bufferchains the transistor size increases progressively at each stage,perhaps by a factor of 5 each time. For the ‘1’ path, starting with afirst stage input Nch width of 8 micron, the final Pch output bufferafter 4 stages of 2150 micron enough to drive 50 pF in under 200 pS.

The input to the first stage of each path is routed through to one (ormore using ‘OR’ gating) of the outputs of the moving-spot sequencer.

In the example simulation, input to the ‘1’ path could comes from Q0output of the moving spot generator, which the input to the ‘0’ bufferpath could come from Q4 of the moving spot generator (which is two fullcycles later of the 3.125 GHz clock).

The results of this arrangement are graphed in the Spice results of FIG.5 a and FIG. 5 b

Pipeline delays from IN and IN_N—rename to Q0 and Q4 are not importantfor the generation of a cycling clock signal.

High-frequency clock power consumption to drive this pipeline is lowwhen a Rotary Clock tap is used since the capacitive energy is recycled.

Shoot-through current elimination: Shown on the “1” path of diagram FIG.4 are transistors which reset the gate on the final Pch (w=2143 u)transistor. This circuity is driven by an ‘early’ output ‘out_lastbut1’from the ‘0’ path chain. An active signal here gives an early indicationthat the ‘0’ output transistor is going to be switched permitting thelarge Pch to be switched off in time to avoid shout-through conductioncurrents in the output stage. Circuity to turn off the ‘0’ outputtransistor by an early indication from the ‘1’ pipeline is not shown butcan easily be derived from the previous example.

With logic gating and programmable tap-points from the moving spotsequencer to the two buffer paths, an arbitrary waveform can be createdwith resolution of 160 pS.

Choosing the other two phases of the 4phase clock can offset thesequence by +/−50 pS.

Because the moving spot sequence is cyclic (wraps around), a continuouswaveform will be generated at the OUT port with reduced frequency thanthe global clock rate.

[Note, the time scales of FIG. 4 and FIG. 5 are not aligned]

Since all the moving-spot generators on chip will be operating in synch,arbitrary local clocks can be created but which have precise phase andfrequency relationships to the other clocks on the chip. This helps withSOC integration of multiple IP blocks.

There are other options besides use of the arbitrary waveform generators(moving spots +programable decode) to provide the IN and the IN_Nsignals for the split pipeline buffers. One idea is to use globallydistributed IN and IN_N signals coming from external pins. Thedistributed IN and IN_N signals can themselves be pipelined (i.e.Re-sampled and re-launched periodically on the higher-frequency rotaryclock clock edges within the distribution) to maintain alignment. Usingthis arrangement allows external control of the internal clock buffersfrom, for example, external test clock generator. There would be latencyin terms of N cycles but the random variation is still small—that of thelast few buffer stages.

OTHER REFERENCES

-   [Lui] Retiming and Clock Scheduling for Digital Circuit    Optimization, IEEE transactions on Computer Design and Integrated    Circuits and Systems Vol. 21, No. 2, February 2002 [Lui] Xun Liu,    Marios C. Papaefthymiou, Eby. G. Friedman.-   [TIM] M. C. Papaefthymiou and K. H. Randall “TIM: A timing package    for two-phase, level clocked circuity” Proc. 30^(th) ACM/IEEE Design    Automation Conf. June 1993.-   [Timberwolf] C. Sechen and K.-W. Lee. An improved simulated    annealing algorithm for row-based placement. In Digest of Papers,    International Conference on Computer-Aided Design, pages 478 481,    Santa Clara, Calif., November 1987.

Figures and diagrams reference to in the specification hereinafter arethose shown on sheets 32/53 to 53/53 of the drawings of the presentapplication.

To design synchronous i.e. Clocked VLSI devices require a combination ofcircuit and software techniques and/or algorithms.

This invention relates to a series devices which may act alone ortogether to aid in the achievement of low-power high frequency GlobalVLSI clocking (meaning across the whole chip as well as local clocking)and support circuitry and software to complete an industrial designcapable of supporting run, test and diagnostic modes. Specifically;

-   -   Global high frequency synchronisation through Rotary Clock        network.    -   Globally distributed synchronisation of low-speed (multi-cycle)        events.        -   Moving-spot synchronisers sub-sampling lower rate events and            acting over the whole chip instantaneously [drawings sent to            Keith]    -   Global low-latency high speed data interconnect mechanism        (synchronous OR asynchronous [latter is the circuit shown to        Reshape])—GB 0218834.0    -   Programable frequency division and/or programable phase offset        to support legacy sub-GHz clocks.    -   Low skew/jitter buffing mechanisms for clock signals—0225814.3        (Jun. 12, 2002)    -   Adiabatic frequency division components—GB0203605.1 (15/2/02)        -   AND idea shown under NDA to Conrad Umich.    -   Adiabatic, energy conserving Logic family—GB0214850.0. (2716/02)    -   Energy conserving high performance latch techniques as discussed        hereinafter        -   incorporating ‘gating’ [Re previous patent]            General Trends in VLSI Design

Here we talk about trends seen in the last 5 years which impact how VLSIchips are designed and implemented.

Interconnect

The biggest change has been from the previous ‘transistor-dominated’design methodologies to moden ‘interconnect dominated’ design.Historically, when Tansistor and therefore logic gate delays dominatedthe design of synchronous systems, little regard was paid tointerconnect delays.

Today interconnect delays dominate circuit performance. Clocking is oneinstance of a long-reach signal—others issues apply to all interconnectsexceeding perhaps 0.1 mm in length when the interconnect delay time canexceed that of a logic gate.

Interconnect must be treated as a first-class physical effect and not assimply as ‘parasitic’ with associated margins to account for the effect.

Timing Problems.

Since interconnect delays are becoming dominant and often it is hard topredict the delays until a circuit layout is complete, ‘Timing analysis’and ‘Timing convergence’ have become essential—Delays must be based onactual placements of wires, buffers clocks to make sure the synchronoussystem will work (all Setup and Hold times on all paths must be met).

Changes to layout may be required to meet timing constraints and thissituation can frequently result in ‘Timing Convergence’ problems where anew layout is tried but which leads to new timing violations elsewherein the design leading to iterations and delay to market.

Concept of a Clock

In a synchronous system, data is controlled by the operation of a clocksignal. The clock controls the time at which data is allowed to change(output clocks) and also the time at which data is captured (inputclocks).

The clock is a global signal routed to all latches on the chip. Ittherefore has the most ‘parastic’ interconnect effects of anyinterconnect and so is subject to the most scrutiny. In fact it must beremembered that is is the relative timings between clock and data whichis important (something that is often overlooked).

Concept of Register (Latch or DFF)

A register here referrs to either a pass-latch (also known aslevel-triggered flip flop). Or edge-triggered flip flop (e.g. DFF).Either of these devices is able to control the progression of a datasignal from input to output by use of a ‘clock’ input signal. The termsRegister, Latch or DFF are used interchangably in many papers and theexact meaning must be inferred from the context.

Concept of Cell

Cells are the generic term for a pre-designed layout pattern which wheninstantiated somewhere on a chip yields a functional component (e.g.NAND gate, multiplexer, latch) after manufacture. Cells arehierarchical—bigger cells can contain smaller cells wired together. Thelowest level cells contain transistor layouts. Most higher level cellsjust contain sub-cells and wiring.

Concept of Paths

For synchronous systems, the concept of a ‘Path’ extends the idea of anetlist to encompass groups of signals originating from registeredoutputs, which combine logically (logic gates) to ultimately arrive as asingle bit input to a single register. with some complex time delaycharacteristics.

The path concept fits well with the realisation that most logicoperations are reductions, usually Multiple inputs->one output.

Constraints on timing relate to paths because:

-   -   1. Relative timings between clocks and data changes are        important.    -   2. Any one of the inputs on the path can possibly change the        ouput which feeds the latch.        [path_and_parasitics.ps ????]

A single Net can be involved in mulitple paths—several registers mayhave their inputs determined in some way by data on one Net.

[Note that the simple Nets assumed during design may be replaced bycomplex interconnect parasitic networks which exhibit delay]

To find all the components of a path involves a search of theconnectivity database (the netlist) starting at the D input of a DFF ofa register working ‘backwards’. Doing this search will typically be doneusing a Graph-database package. The search result ‘fans-out’ as thealgorithm progresses collecting Nets and Cells involved in the pathuntil ultimately every branch had ended in the output of anotherregister.

Path analysis is primarily used for timing analysis and is not usuallyconcerned about the logical functionality (except where false-pathanalysis is determined).

Registered elements produce and receive signals at fairly well-definedtimes (given by the clock) unlike logic-gate paths and interconnectwhose speed can vary greatly. The primary purpose of clocks+registers isto remove timing uncertainty by adding delay or storage.

A Path for the purposes of this paper is therefore is the collection oftime-delaying items (interconnect and gates) between the(clock-stablised) registered outputs and a registered inputs.

Static timing analysis is used to check that none of the paths in acircuit fail because of setup or hold time violtation.

Setup and Hold Constraints

The typical DFF register (from the user's point of view) responds to arising edge of a clock waveform—capturing the data signal value whichexisted before the edge of the clock. In practice the DFF is not aninstantaneous device.

Well known constraints on synchronous systems are Setup and Hold. Thediagram shows to possible problems when sampling data. In both casesabove, a ‘0’ is intended to be captured since the data is zero beforethe rising clock edge occurs.

-   -   Hold time violation: Data must be held stable for a small time        (Hold time) after the rising edge or else a Hold-time violation        occurs.—In the diagram above the first clock pulse is supposed        to clock in a ‘0’. But, the data changes from ‘0’ to ‘1’ too        soon after the rising edge which might cause the ‘1’ to be        sampled instead of the ‘0’. To prevent hold time problems the        data must not change until at least the DFF's specified hold        time after the edge.        -   Fixes: There are three possible fixes to hold-time problems.            -   1. Make the logic circuits in the data path slower—so                data cannot change too soon            -   2. Adjust the clock phase to the register so that it                occurs earlier.            -   3. Adjust the clock phase of all the registers which                feed this path to a later phase (achieves the same                as (1) above but constraints apply.)    -   Setup time violation: Data must be stable for a sufficient time        (Setup time) before the clock edge occurs. Above, the second        clock pulse is expected also to sample ‘0’. But, there has not        been enough setup time prior to the rising edge and so a ‘1’        (the previous state of the input) might be sampled. [This occurs        because a DFF is NOT really an edge triggered device it        continuously samples the input state while the clock line is        low. This sampler cannot respond instantly to changes in Data.].        -   Fixes: To fix setup time violations there are three choices            -   1. Make the logic circuits faster so the data changes in                time for the clock.            -   2. Adjust the clock phase of the register to occur later            -   3. Adjust the clokc phase of all the registers which                feed this path to an earlier phase. (achieves similar to                1 above but subject to constraints)

From above, the symetry of the Setup and Hold problems can be seen inrespect to the cause and possible solutions. Known methods of movingclock phases are called variously ‘Scheduled Skew’, ‘Slack-Borrowing’,‘Time stealing’ and is accepted industry practice.

Another method of sequential circuit optimisation is called ‘Retiming’[Ref SIS paper] where the positions of registers are moved along thepaths in an attempt to equalise the delay times. A register feeding theimput of a logic gate can be moved to the output of a logic gate (orvice versa) depending on well known rules which maintain logicalequivalence and timing

Hierarchical Clocking System (the Priority Document Hierclock)

Earlier rotary-clock centred circuits focusing on improving clockgeneration and distribution [previous figures in hierclock application]by forming grids of rotary clock structures were given. 4 phasedistribution was outlined as an option. Localised clock division andarbitrary waveform generation for multiple frequency/phase related clockgenerators over the surface of a chip was discussed and called BWB(Binary waveshaping blocks). Key ideas were the global synchronisationof events using locally communicating state machines arranged in a chainto avoid the long-distance communication overheads.

As these ideas have been refined, a proposed test chip architecture ispossible as shown in [testchip4.ps ???]

Other recent developments and improvements to the hierarchical clockingscheme are set out in the rest of this document with appropriatebackground information . . . .

Slack Budgets & Multi-Phase Clocking—the Concept of ‘Slack’, ‘CriticalPath’

Slack is just a measure of the amount of ‘spare’ or ‘slack’ timeavailable on a synchronous path before a Setup time violation mightoccur. If all paths of a synchronous machine exhibit slack then theclock cycle can be reduced until one path becomes ‘critical’ i.e. itreaches the setup-time limit. Ibis is then the Critical-Path of thesystem and sets the time (in single-phase systems).

Multi-phase synchronous systems (as well as so-called asynchronoussystem) i.e. Those which can have more than a single timing referenceare able to break this time limit by resheduling the pipelines to passslack from fast-paths onto slow paths which suffer tight or negativeslack. The limit in these cases is that for a pipeline of N stages, thesum of all the delays of N paths along the pipeline must be less thanN*tcyle. For example a 3 stage pipeline operating at 1 GHZ could havepaths of 0.5 nS, 2 nS, 0.5 ns and it would still work at 1 GHz

Slack is measured in units of time, typically picoseconds and must bezero or higher under all conditons for a synchronous circuit to work.Negative slack numbers sometimes appear in timing analysis meaning thetthe clock period must be increased for the circuit to work.

Slack, which refers only to setup-time constraints, is the term mostwidely used in the literature to describe timing issues. Hold timeviolations for the typical DFF edge-triggered, single-phase systems areeasily fixed and often do not receive much attention. For generalanalysis, it is not possible to study a synchronous system purely interms of slack especilly where multiphase clocking or transparent (leveltriggered) flip flops are used.

The complete conditions for synchronous operation given Setup and Holdconstraints are given in [Lui].

Traditional Synchronous System Design Flow

Design of a synchronous machine involves CAD tool steps to produce thephotolithographic outputs.

-   5. High-level-descripiton (HDL) e.g. VHDL, Verilog source code    created by a human designer.-   6. Logic synthesis—mapping the intended logic and state transitions    to a combination of pre-designed Latches, Gates and Buffers    (collectively known as cells) and Netlists (interconnects) to    implement the function. Clocks control the latches and control the    state change from one to the next and are often assumed to be single    phase control lines routed all over the chip.    -   The timing of the circuit is only an estimate at this point        because until the chip is placed-and-routed the final parasitic        capacitances are unknown and can change the critical path        length.-   7. Place & Route    -   Place: cells are positioned on the chip layout using a CAD tool        which often attempts many possible layout configurations to        optimise various functions such as ‘minimum wirelength’ ‘optimum        timing’.    -   Route: Auto-routing software takes the placement information of        the cells determined by above, plus the Pins (inconnect        locations on each cell) plus the netlist (which pins connect to        which other pins) to determine the interconnect paths.    -   Placement is normally not affected by the idea of clock signals        because it is assumed the clock line will be available        everywhere like the power lines.    -   Routing of the clock lines is performed by a special tool called        ‘CTS’ Clock-Tree-Synthesis, a special auto-router e.g. H-tree        which can also insert active buffer elements on the more        advanced versions.-   8. Timing analysis and Convergence.

Today in industry there are many possible approaches to the above tasks.Most algorithms mentioned above use heuristics and iterative approachesto optimisation. For example, a well known Auto-placement code calledTimberWolf uses a ‘Simulated annealing’ method. Cells are moved atrandom and each new placement is evaluated to see if it improves thegoal (lowers the cost-function) of any number of factors which areevaluated at each iteration. Common cost functions are totalwiring-length, delay time. Clock related placement of latches is notundertaken since a ‘single-phase-everywhere’ methodology means that theclock is seen as a global resource much like power and ground.

Mutligig Rotary-Clock Design Flow

-   1. HDL    -   Identical to above-   2. Logic Synthesis.    -   Identical to above. A standard tool runs from the HDL code to        produce a list of logic gates, an initial list of registers and        a netlist giving the interconnect between items.-   3. Sequential Optimisation and phase-spreading methodology.    -   This is a new step but based on known ideas.    -   The following operations are performed on the netlist in        accordance to the specified reference papers.        -   a) Retiming        -   b) Clock skew scheduling        -   c) Optionally conversion from edge-triggered to            level-triggered flip-flops [TIM paper]        -   are performed sequentially or simultaneously [Liu]    -   The result of a, b, c above is a new netlist where the logic        gates remain the same as a standard flow but the registers        configuration is changed (we do not discount the possibility of        doing logical optimisation such as Espresso [berkeley] tool at        this point). The number, placement (in the netlist) for each        register may be different to the standard flow. Addionally a        clock skew schedule (annotation of the optimum phase of each        register) is produced and it is a methodology for mapping this        schedule (via placement) onto the Rotary Clocks' natural ability        to generate multiphase clocks which is one aspect of the        invention outlined here.-   4. Place and Route.    -   We call this type of algorithm, where logic path cells are        placed relative to latches which in turn are placed at known        phase-points of the clock, Placement Driven Timing’ to contrast        with the usual ‘timing driven placement’ which attempts to place        based only on data timings, assuming usually a single-phase        clock or at least a clock with small amount of skew.

The prototype of the improved flow uses a new cost functions built intoTimberwolf to promote the placement gates close to the appropriatelatch. On each placement iteration of the simulated annealing method,the tolerance of phase is detemined for each unconnected output of cellswhich are to feed the D input of a latch. If the placement is closeenough to a latch, which by connection to the local rotary clock phase,has a suitable phasing, the placement is retained. The final drawing ofdesignflow.sdd shows that any one of 4 possible phasings is availablefor any latch just by permutations of the via pattern into the Clocklines. Therefore 4 possible phrases can be evaluated fur every possiblelatch greatly increasing the chances that a suitable timing can be foundand a complete spread of loadings onto the Rotary clock will beachieved. Use of transparent pass-latches will extend the margin evenfurther.

Results of the placement feed to the Routing phase of layout which canbe achieved with standard tools.

The flow is outlined as a flow chart in the diagram

-   (timberwolfflow.sda ???] and in more-   detail in (designflow.sdd ??]    Testing of Rotary Clocked Circuits.

Coupled LC based oscillators like Rotary Clocking [ref original patent]are inherently difficult to stop for gating, testing purposes becauseenergy is contained in the circuits and cannot be immediately releasedin a fully controlled way.

The rest of this section describes in-principle additions to latches andancilliary circuity to allow for single-stepping, BIST and scan-testingto be performed on Rotary Clocked chips through indirect means ofmodification of the storage elements (latches or DFFs) which are drivenby the clock.

The basic principle is to synchronously data-gate latches connected tothe clock lines to mimic traditional clock gating where, say an AND gateis inserted in the clock path. There is a direct equivalence of clockgating and data-gating and no perceptible difference externally and nodifference in area to implement.

Synchronous Data Gating (as implemented within the proposed latchesfurther below Previously suggested circuits

-   -   Patent [PCT, current one ????] has descriptions of data gating        for Rotary Clock as an alternative to clock gating.        -   This is EXACTLY equivant in terms of effectiveness BUT can            save area because stopping activity upstream will, within a            few cycles stop downstream activity. [new concept of looking            through the BDD? graph and finding where are the best places            of data gating to stop forward switching activity—might only            be a few such places]    -   Patent [PCT, earlier one perhaps] has        -   power-down of rotary clock—this can be done OK once an            orderly ‘stop’ had been performed using the latches.        -   descriptions of real-clock gating with pass transistors            Newer Circuits:

Propose here methods to extend the above concepts and synchronously gatelatch elements driven by a rotary clock to prevent spurious sampling.

These circuits require circuitry [Keiths new circuits] for multi-cycleglobal synchronisation using locally cooperating state machinesoperating of a phase-locked global clock.

Latch Technology to Suit Rotary Clock Flow

All synchronous system rely on some kind of latching element to controldata flow. These are referred to variously as Latch, D-flip flop (DFF),Register. These circuits use clocks to make path delays less uncertainby allowing changes only a specified times relative to the clock timingsource.

Since the late 1980's a single-phase edge-triggered D flip-flopmethodology has been preferred industry practice. The biggest barrier tothe previously common multiphase clock distribution methods has been thedifficulty in creating and distributing more than one clock phase whilemaintaining relative phase accuracy one other.

For Rotary Clocking, many different DFF, Pass-latches designs wereevaluated. However most latches and FFs use internel buffers andinverters because of their single-phase lineage. When driving from atrue differential clock source such as Rotary clock these are notrequired.

Another useful attribute for any latch device used with an L-C basedclocking scheme is constant capacitive loading presented to the Rotorwiring (clock loading which doesnt depend on the data being passedthrough the latch). Without this there can be pathelogical worse caseswhere all latch data switches from 0 to 1 changing the capacitance,therefore period, and therefore phase stability.

There is a lot of inherent tolerance to capacitance variations affordedby the multiple rings of a rotary clock.

True DFF Latch

Fig? Shows a true edge-triggered DFF latch suitable for use with Rotaryclock. It has many of the preferred features regarding clock inputslisted previously for Rotary Clocked operation.

Note:

-   -   that the feedback from the buffered output and the STOP        components gives an edge-triggered characteristic where the        output state cannot change after the active rising edge no        matter what happens on the D input    -   PS and NS are turned off at the inactive part of the clock cycle        to re-arm the latch        [dff_fast.ps]        (picture of waveforms from above)        Pseudo DFF Latch Proposal        [constant_clock_C2.ps—with the SRAM I/F]        (picture of waveforms from above)

A design of a simpler and faster latch element is shown in Fig?.

This circuit is essentially a pass-latch but is intended to becharacterised and operated like a DFF.

Since it is transparent while the clock is high, it exhibits a longhold-time characteristic compared to a DFF for which it is a stand-in.However it transpires that at very high frequencies this hold time isless than ½ of a clock cycle due to delay times in the output stage ofthe latch and there is very little difference between it and amaster-slave latch when operated at one specific, or a small range ofoperating frequencies—perhaps 2:1 range.

Safe useage of this latch for multiphase clocking requires that thesequential optimisation stage meets setup/hold times of all latches.

The latch is designed as a split-path where the Zero and the Onecircuits are separated to improve speed and to eliminatecross-conduction.

Note:

-   -   Clocked transistors N1,P1 are not inline with the data but        connect to the supplies. Gate capacitance is largely unvarying        with data input value since the channel of the clocked        transistors fully charges and discharges from a solid path, to        either VDD of Gnd at each half of clock phase for both clocks        (true and complement) through the transistor source connections.

Hold i.e. Stop arrangements:

Transistors N5, P5 control the “effective clock-gating”. While for SOIprocesses, true clock gating is feasible with Rotary Clock, bulk CMOShas too much RC to perform clock gating efficiently. It was shown in[PCT????] application that there is seldom any need to gate the RotaryClock (why disable the clock when it isnt using much power?) but forSCAN testing (see section further below) it is essential to hold thestate. N5, P5 perform ‘data gating’ which is ‘effectively clock gating’to hold the state of the latch when *STOP is high and STOP is low. Also,choking the data makes downstream logic of the latch inactive reducingdata-activity related power consumption—again directly comparable withclock gating.

(Ideally the stop signals have a low-impedence turn on/off drivecharacteristic but a high impedance quiescent drive to to isolate thegate capacitance from the D input path as far as it would slow down theoperation of the latch.)

Generation of the STOP signal event must be carefully controlled intime. The global synchronisation method outlined in GB0203605.1 andimproved versions of this circuit outlined here can achieve thisglobally simultaneous “STOP” signal which immediately freezes the stateof the whole synchronous machine—at which point the state can be dumped.

Effective “Functional clock gating” can be implemented where the STOPsignals are generated from logic signals—possibly qualified by the localrotary clock to ensure Start/Stop occurs only during latch inactivetime.

Clock activity will usually continue during the Stop period so thatrestart can be synchronous and glitch fee.

Using Pseudo-DFFs with Different Clock Phases

The latch discussed above could, if neded, be used in pairs to act onone signal. Each latch of the pair having different *CLK and CLKorientations to implement a non-shoot-through DFF type arrangement whichwould work down to very low speed.

A further option is that the pair could use 90 degree (4 phase) relativealignment and given the delay time would not suffer shoot-through over abroad set of high clock frequencies.

-   -   This represents a very aggressive methodology but supply voltage        binning ought to push all the hold-failures away—if chip is        failing on hold times, reduce supply voltage. Will move the        potential over to setup time failure—but with transparent        latches will be some budget here also.        Global Synchronisation Methods—e.g. Generating the STOP Signal        for Latches Over the Whole Chip at the Same Time

It is well known that it is difficult to transmit a global signal acrossa chip within a very short clock cycle. Measures such as truetransmission-line techniques (lightspeed application) can extend thedistance a signal can move in a given time period but often the overheadof such an approach is not needed when update rates are slow.

The goal of the circuits given here is to make a generic low overheadmethod of synchronisation of low-speed external events with high-speedinternal Rotary clocking. The signals are ‘undersampled’ in that manyRotary clock periods are allowed for a low-speed signal to become stable(giving them time to propogate fully across the chip from external pins)but after this /N count latency of the high-speed clocks, the event canbe simultaneous over the entire chip.

One such use of a signal would be the STOP signal for latch control (seeFig? Latch design). For example, an external STOP signal is driven ontothe chip and the resynchronisation method (operating off the locallyinactive phase of the clock) will generate the required STOP signalwithout corruption.

With the ability to effectively stop the whole chip simultaneously overthe entire chip area, the usual problems of slow interconnect areovercome at the expense of latency.

The necessary mechanism for global multi-cycle synchronisation throughmultiple short-distance local synchronisation links was decribed in the[original hierarchical clock filing] in the section on Multiple Global,frequency-divided clocks.

-   additional diagrams [keith drawings] are offered here as illustative    further examples of the details of how this could be implement.-   (Keith's version of the divider—circuit he sent to me).    Modified Gates—Incorporating Latching Function.

[nandlatch.ps ???] The only changes relative to a standard NAND gate arethe clock gated power transistors. When clock is inactive, the gate isnot powered and is unable to drive the interconnect. In the activeportion of the clock, the output capacitance is charged with the normalnand function !(A&B). Gating in this way can control the outputtransistion time for early input signals.

Gated Interconnect (i.e. Synchronous Repeaters)

[gated interconnect.ps ???].

Gating of data can be perfomed outside of logic gates and latches. Thedrawing [fig?] shows gates placed in-line with the interconnect. Therewill be some data-dependent clock capacitance and this can be toleratedto a limited amount. When buffered it becomes a synchronous repeater.These items and the modified gates of [fig???] would typically not beinserted to hold state (so do not need to be ‘Stopable’) and function toequalise the delays around multiple branches of a path [depends onsequential optimisation strategy].

Testing of Digital Circuits (Background Information)

Synchronous VLSI chips require the clocking system to provide not onlysystem timing to control latches and other storage elements but amechanism to aid in testing of the finished silicon which can exhibitseveral forms of failure usually from physical defects caused by e.g.Contamination or optical problems during manufacture/lithographyrespectively. Some of the most common faults are:

-   -   1. Suck-At fault        -   this is where a defect causes a circuit node to be stuck at            logic ‘0’ or logic ‘1’.    -   2. Delay fault        -   a fault which doesnt affect the logic operation but causes a            path to take a (usually) longer time to evaluate than            normal. This faults prevent the device working at the            intended clock speed and can reder the device unsalable.    -   3. Leakage current fault        -   where dynamic nodes can fail to maintain its charge for the            mimimal amount of time. This fault will show up by a device            not working at all, or else failing at elevated temperature            or lower than nominal operating speed.

The above are usually random failures in manufacturing and reduce yieldsomewhat, but even a device designed correctly is subject to othersystematic faults which may affect every chip fabricated—sometimesoptical interactions or combinations of manufacturing tolerances cancreate unintended features on chip at the same point on every chip, orat the same regions of the wafer.

Systematic faults are the most troublesome and must be debugged and canrequire a re-spin of the masks, or rework to the process. In eithercase, unless diagnosis of the problem is possible through testing, thencorrection is impossible and the yield could be zero.

External Test/Debug

Debugging from outside a chip is of limited use these days—only a tinyfraction of the signals which a VLSI device uses are available on theexternal pins for measurement. The same problem applies to stimulus—notenough pins. Finally the speed at which modern chips can run is often10× or more faster than a production-line tester can operate at.

Testing Aids (Internal).

The current solution is to devote on-chip hardware specifially to enabletesting of the device itself using test patterns. These digital testpatterns can excersice the internal logic of a device with knownstimulus, and since the logic is supposed to be deterministic, theoutput should be predictable if the device is functional and this outputcan be tested for compliance to check if the chip is working.

For conventional JTAG (a published standard) scan testing, the testpatterns are generated using ATPG (Automatic-Test-Pattern-Generation)software during the design of the logic elements through logic synthesis[ref: SIS public domain system from Berkeley]. The test patterns aredesigned to fully exercise the logic to reveal any possible stuck-atfault. Using shift-registers (or possibly the DFFs reconfigured to actas a chain) to shift in the Test-pattern as a machine state (asynchonous system is defined at any time entirely by the states insideits storage elements) a single clock pulse can be issued to move themachine state onto the next state. Then, the new state captured from thelogic is read out and compared to the expected result.

This is a time consuming process and tester-time is expensive. Anotherdrawback is that scan-based approach traditionally can only identifystuck-at faults, but not delay faults of leakage faults since the clockperiod generated by a tester is generally not fast enough. A secondapproach is called Biult-in-self-test (BIST) where on-chip pseudo-randompattern generators are employed. Each of these generates a deterministicbut highly changeable pattern (squenced by the clock) and the patternfeeds the logic. Outputs from the logic are captured and condesed usinga type of running checksum algorithm, again synchronous with the clock.After a long series of many clock cycles the checksum should be of aknown value if the logic is functioning correctly. This can be testedagainst a known-good sample checksum or a checksum computed by softwarewhich is aware of the generators' pattern and the checksum generatoroperation.

BIST has the advantage that it will work at full clock rateunconstrained by a tester's limitation and also that it is very muchfaster to self-test.

Problems are that fault-coverage is not 100% and debugging at a detaillevel is more difficult since it is not feasible to preset the exactstate of the chip.

Coverage of delay-faults is incomplete as many times delay faults aredue to coupling issues not always captured by the pseudo-randomsequence.

Scan-Type Circuits

Here is an example of the scan methodology applied onto a Rotary Clockedcircuit and makes use of ‘Lightspeed’ links to transmitt serial data,such as scan data, faster than oridinary repeated-interconnect.

[scanlatch_PCT.ps]

Features of the circuit shown above

-   -   Single-Step able (using the external step signal)—probably one        internal pulse in 100 clocks        -   Run at full speed upto count N then stop and dump the state            (difficult but fast method of finding the faulting cycle)        -   Scan in a complete state (moving spots doing the sequencing            at high speed)        -   Scan out state at high speed using lightspeed link            Timing Sequence    -   Scan in with EN_m and EN_s inactive.        -   Q will hold previous value            -   (Scan out—M will be sampled (old state read out) in one                ½ cycle)        -   M will be set by scan in on the next ½ cycle from moving            spot register    -   Step-and-Stop        -   Synchronously all over the chip, CLK goes LOW (Oust prior to            the single-step cycle)        -   EN_s should go high now while CLK=LOW (ready for high time)            which doesnt cause any output        -   CLK goes HIGH, Q (slave) output begins to go valid from the            data in the master (last scanned in, or last sampled from D)        -   EN_m goes high during CLK=HIGH time (*CLK inactive) which            allows the master to sample when the CLK will go back low        -   CLK goes LOW again (*CLK goes high) Master is sampling the            data,        -   EN_s should go low to prevent the captured data going            forward on the next ½ cycle.        -   CLK goes HIGH again. Master stops sampling the data,        -   EN_m should go low to so next time clock goes low, a new            sample isnt taken (or else it will spoil the delay-fault            test because there would be a whole new time to sample)            -   (Unrelated Possibility here of doing a virtual /n on                clock e.g. sampling multiple times without Qs changing)    -   Scan out/in        -   Scan out and in can be performed now—e.g. input new vectors            while getting out the old ones.        -   compare off-line the readout compared to the predicted ATPG            vectors -OR- new step.            -   Now the Goto step again (based on universal chipwide                event)

The above will find delay-faults because if new data is loaded in, itgets Output fresh in a new period.

-   -   EN_m can change when CLK is high (*CLK is low)    -   EN_s can change when CLK is low        SRAM Type Interface to the Latch Data        [fig???.ps]

Typically a scan-chain technique would be used to scan-in and scan-outtest data to a chip (sec above).

An alternative circuit proposed here uses an SRAM-type interface to thelatches giving random Read-Write access.

According to the prefabricated Rotary Clock layout technique outlinedpreviously, latches can be arranged as Rows and Columns underneath theclock lines (latches can also be placed anywhere and wires can connectthem to the nearest rotary clock lines). This Row/Col layout correspondsexactly to an SRAM layout (well known in industry) and withmodifications the Latch storage element can be configured to workexactly like a The latch shown has transistors N7 . . . N9, a singleColumn select line and Row select lines WRITE, READ. Data signals arealso routed in metal layers different from the clock structures in asimular X/Y pattern. Row, Column, Data signals would be routed to Padsto get the signals off-chip to connect to a tester. Additionally thechip itself (perhaps an on-chip test controller) could drive the SRAMinterface to the self-test latches.

The SRAM overhead is very small—a 10×10 mm chip with 100K latchesrepresents a 0.1 Mbit SRAM—tiny by modern standards. The same chip islikely to have 2 Mbits of cache memory on-board. The overhead on wiresand pins is small. The test-mode does not have to be sub-nanosecondaccess (unlike cache) so design is fairly straightforward. Internalcontrol of the STOP signal and SRAM Read/Write interfaces permitsarbitrary localised testing, state dumping/restoration of the latchstate (perhaps to external memory) and can help facilitate power-downmodes.

Random access testing solves two problems typical of Scan chain methods:

-   1. Excessive power from scan-chain activity (usually causes    excessive power consumption because all logic items on a chip will    be activated by the shifted data) is eliminated.-   2. Testing bandwidth is improved relative to scan-chain because the    SRAM testing interface is inherently parallel (low-speed parallel    testers can achieve higher throughput).    N-Count Test Mode:

Whether Scan or SRAM interface, taking a snaphot of and then dumping thestate of machine enables very powerful diagnostics.

One such scheme practiced in Industry is binary-search testing.

In this mode, the state of the machine (state of all storage elements)is initialised (either Reset or Preset with scan-in vectors). Then,N-clock cycles are issues which moves the machine onto the Nth cycle.

The state is dumped externally and compated to the state predicted by asimulator which is emulating the hardware. If the two sets of state datado not match then a logical operation has gone failed somewhere in the Ncycles. The test is repeated from the same initial state but with N/2cycles and the state compared to the N/2 states predicted by thesimulator. The next test might be N/4 or N*¾ depending on the results ofeach compare. Very quickly the exact clock cycle which caused the faultis determined.

The drawings [testchip4.ps???] shows an external counter used to drivean on-chip STOP signal after N counts using the global synchronisationof lower-rate events detailed previously in this text.

The ‘STOP’ signal is given to the chip after counting N events.

Obviously the /N counter could also be internal on a production chip.

The global synchronisation circuitry [global_synch_system.ps ???] methodcould be employed—One of the control inputs shown could be the ‘STOP’signal for which the circuitry shown could transfer this over the chip.For the N-cycle-then-stop signal input, latency can be used in the sameway. There may be Y cyles of latency on-chip in the N-cycle-then-Stopscheme (say 8 cycles delay) for the STOP but if the tester enters N-Yinstead of N as the number to the register shown on[global_synch_system.ps ???] stoppage will occur on the correct cycle.

Power Saving Modes.

Previous Hierarchical clocking scheme outlined methods of frequencycontrol. Previous applications showed voltage regulation andpower-supply voltage changes to reduce power when Idling.

This can be extended to:

-   -   Voltage scaling simultanous with Speed changes. E.g. Gradually        dropping frequency (smoothly) while lowering supply voltage—this        could easily be achieved here. Also, if data is gated, chip        voltage can be reduced to below that which it would be logically        functional but state is not lost.        Software Flow Improvements

A common requirement when applying Rotary Clock methodology to anexisting design would be to improve performance and reduce powerconsumption.

The existing design is most likely to be a Single-phase, assumed zero(or low) skew methodology using DFF registers.

A well known method of improving synchronous performance is to applypipelining. Pipelining inserts storage elements between sequentiallyplaced logic gates in a path to reduce the number of gate delays beforeresynchronisation.

Definition of ‘System Register’, ‘Pipeline Register’

A system register we define as one of those coming from the original DFFsynthesised circuit (before being fed into the special flow). Extraregisters added to implement pipelining for the Rotary Clock flow aredefined as ‘pipeline registers’.

Keeping the ‘system registers’ at the nominal ‘same-phase’ tap points onthe ring means that the high-level timing analysis doesnt change.

Design/timing analysis using pseudo-DFF style

-   -   Design for the data changing before the clock edge (like a DFF)        -   Benefit Transparency gives some safety factor, that if an            edge arrives late it will propogate through late and hope            that this lateness will not accumulate downstream such that            things fail.        -   Can use standard timing analysis    -   ‘System’ registers (not the pipeline registers) can be on the        single-phase portion of the ring, say +/−2.5%=5%=10% of the        loopa and might simplify timing analysis.        -   System registers can be used as ‘reference’ point in the            timing analysis engine rather than worring about all the            delays to help reduce explosion of possible state/time            transition graph.        -   System registers probably correspond to the low-speed ASIC            registers before Rotary-Clock pipeline elements are added            (pass latches) and represent a good sign-off point of the            architectural.            Choice of Synchronising Elements During Sequential            Optimisation

In the flow to be outlined, the algorithm which undertakes retiming andclock sheduling and will choose the appropriate device from the listabove. A full DFF (or two pass-type latches back-back on oppositerelative phasings) would be chosen for system registers (as definedabove), a single Pseudo-DFF would be chosen when the hold timerequirment of the pass-type latch does not cause a problem.

Both the previous choices would probably be configured for testability.

Then, along fine-grain pipeline stages, the clock-gated logic gate ideacould be used when scanability is not vital. Finally, gated interconnectcircuits could be inserted to normalise path delay variation (fromdifferent logic state routes through the path).

Pipelined buffer [See included material]

Why these would be used in the overall system—explain.

Misc Circuits

-   -   Wave shaping using multiphase rotary clock capacitively driving        a single point [capacitor_array_waveshaping.ps] Need arises to        make a less than sharp square edge when driving adiabatic or        energy recovering logic circuit. The aforementioned diagram        gives simple method of using multiphase tap points to create a        capacitive divider effect. Using different size capacitors can        tailor the waveshape. Ratio of total array capacitance vs. load        (to-ground) capacitance determines amplitude of the final wave.    -   Phase locking between Rotary Clocks having other than 3f        frequency differences [4phase_f_lock.ps] is a partial circuit        giving the general method where a multiphase and low-speed clock        and a two-phase high speed rotary clock can be phase locked        together using logic gating. Similarities can be seen to the        adiabatic frequency divider concept. Noting that 2phase, 4phase        distictions are only geometrical connection-point wire routing        issues with Rotary clock—since all ‘liquid’ phases are available        on every ring.        SGIG Claim.    -   Logic circuitry driven by Adiabatic Rotary Clock where        interconnect capacitance as well as all logic capacitance        becomes an extension of the Rotary Cluck load and energy is        therefore recycled.    -   as above where Nfets only are used.    -   As above where charge pump sampling cr        Lightspeed Claim.    -   (Relates back to the first US division of the 1^(st) clock        patent for data transfer mechanism)        -   Transmission-line link with self-biased termination with            ratio of supply voltage nominally same as the capacitive            divisor ratio of the interconnect capacitance to VDD/VSS            thereby reducing power supply noise sensitivity.        -   Pulsed transmission-line-drive mode to create high-frequency            components only and no residual signal between bits            permitting high gain with simplifications of no            precompensation.        -   Similar claims to US division regarding linking it to Rotary            clock source at both ends and knowing the phase delay down            the wire and choosing possibly 1-of-4 (or more) phases at            the receiver to synchronously decode.        -   Extension to off-chip signalling using 4 phase oversampling            (SERDES—did I ever write that one up?).

An aspect of the present invention teaches the provision of an Adiabaticfrequency divider from Rotary Clock.

A further aspect of the present invention provides a Frequency controlusing distributed digital serial interface driving switched-capacitorload selection to change LC operating frequency of oscillators.

A still further aspect of the present invention provides a Combinationof varactor and switched-capacitor control driven be a controller or FSMas described to cover wide range of frequency/phase locking efficiently.

A Synchronous system design methodology (Flow) according to the presentinvention incorporates the following algorithms and steps:

-   -   Clock Scheduling and Retiming (sequential steps or concurrent        optimisation) which guides an autoplacement step to deliver the        multiphase shedule according to the optimisation on a real chip.    -   Where synchronous repeaters, latches, or clock gated logic gates        are selected driven by multiphase clock to normalise path delay        variation and permit more aggressive timing budgets.    -   A still further aspect of the rpesent invention provides a Logic        circuitry driven by Adiabatic Rotary Clock where interconnect        capacitance as well as all logic capacitance becomes an        extension of the Rotary clock load and energy is therefore        recycled. Preferably, Nfets only are used, and in an        advantageous development charge pump sampling cr is also used.

The present invention also provides a transmission-line link withself-biased termination with ratio of supply voltage nominally same asthe capacitive divisor ratio of the interconnect capacitance to VDD/VSSthereby reducing power supply noise sensitivity, and Pulsedtransmission-line-drive mode to create high-frequency components onlyand no residual signal between bits permitting high gain withsimplifications of no precompensation.

Advantageously, the transmission line link is linked to Rotary clocksource at both ends and knowing the phase delay down the wire andchoosing possibly 1-of-4 (or more) phases at the receiver tosynchronously decode.

The arrangement may be Extended to off-chip signalling using 4 phaseoversampling.

1. A method of synchronizing a circuit comprising the steps ofsynchronising the circuit globally using a high-frequency clock signal,further synchronising at multiple lower frequencies by cooperativeshort-range state machines clocked by the high-frequency clock, amidsynchronising the state machines to each other by exchanging rolloversignals between them.
 2. A method according to claim 1, comprising thefurther steps of resynchronising of low-speed, high propagation delaysignals from Off-chip to create globally simultanous signals usinglatency and the fact of high-frequency synchronicity coupled to thecooperative state-machines.
 3. A method according to claim 1 or claim 2,comprising the further step of phase locking between rotary structurewhere logical gating produces other than 3f(square-wave-harmonic-series) locking.
 4. A method according to claim 3,wherein logical gating produces 2f locking.
 5. An electronic circuitsynchronized according to the method as claimed in any of the precedingclaims
 6. A circuit according to claim 3, whereing the circuit is a scancircuit having SRAM-type randon access read/write method.
 7. A circuitaccording to claim 4, further including gated latches.
 8. An energyconserving LC clocking system having progressive simultaneous frequencyand supply voltage reduction.